High-Recall Document Retrieval from Large-Scale Noisy Documents via Visual Analytics based on Targeted Topic Modeling
نویسندگان
چکیده
We present a visual analytics system for large-scale document retrieval tasks with high recall where any missing relevant documents can be critical. Our system utilizes a novel user-driven topic modeling called targeted topic modeling, a variant of nonnegative matrix factorization (NMF). Our system visualizes a topic summary in a treemap form and lets users keep relevant topics and incrementally remove uninteresting topics in our treemap view without losing potentially relevant documents.
منابع مشابه
VisIRR: Interactive Visual Information Retrieval and Recommendation for Large-scale Document Data
We present a visual analytics system called VisIRR, which is an interactive visual information retrieval and recommendation system for document discovery. VisIRR effectively combines both paradigms of passive pull through a query processes for retrieval and active push that recommends the items of potential interest based on the user preferences. Equipped with efficient dynamic query interfaces...
متن کاملVisual Analytics for Interactive Exploration of Large-scale Document Data via Nonnegative Matrix Factorization
Due to an ever increasing amount of document data and the complexities involved in their analyses that will reveal meaningful insights, it is crucial to guide users in their decisionmaking processes using advanced methods that are both interactive and information preserving. Numerous computational approaches from machine learning, data mining, information retrieval, and natural language process...
متن کاملiVisClustering: An Interactive Visual Document Clustering via Topic Modeling
Clustering plays an important role in many large-scale data analyses providing users with an overall understanding of their data. Nonetheless, clustering is not an easy task due to noisy features and outliers existing in the data, and thus the clustering results obtained from automatic algorithms often do not make clear sense. To remedy this problem, automatic clustering should be complemented ...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملImproving Retrievability and Recall by Automatic Corpus Partitioning
With increasing volumes of data, much effort has been devoted to finding the most suitable answer to an information need. However, in many domains, the question whether any specific information item can be found at all via a reasonable set of queries is essential. This concept of Retrievability of information has evolved into an important evaluation measure of IR systems in recall-oriented appl...
متن کامل